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ABSTRACT 

Many large-scale Web applications that require ranked top-fc re- 
trieval such as Web search and online advertising are implemented 
using inverted indices. An inverted index represents a sparse term- 
document matrix, where non-zero elements indicate the strength of 
term-document association. In this work, we present an approach 
for lossless compression of inverted indices. Our approach maps 
terms in a document corpus to a new term space in order to re- 
duce the number of non-zero elements in the term-document ma- 
trix, resulting in a more compact inverted index. We formulate the 
problem of selecting a new term space that minimizes the resulting 
index size as a matrix factorization problem, and prove that find- 
ing the optimal factorization is an NP-hard problem. We develop a 
greedy algorithm for finding an approximate solution. 

A side effect of our approach is increasing the number of terms 
in the index, which may negatively affect query evaluation perfor- 
mance. To eliminate such effect, we develop a methodology for 
modifying query evaluation algorithms by exploiting specific prop- 
erties of our compression approach. 

Our experimental evaluation demonstrates that our approach 
achieves an index size reduction of 20%, while maintaining the 
same query response times. Higher compression ratios up to 35% 
are achievable, however at the cost of slightly longer query re- 
sponse times. Furthermore, combining our approach with other 
lossless compression techniques, namely variable-byte encoding, 
leads to index size reduction of up to 50%. 

1. INTRODUCTION 

Web search engines and other large-scale information retrieval 
(IR) systems typically have to process query workloads of thou- 
sands of requests per second over large collections of documents. 
Usually, the result of the retrieval is a ranked list of the top few (fc) 
results. Top-k evaluation of textual queries is used in a large num- 
ber of Web applications such as search, textual advertising, and 
product recommendation. 

Top-fc retrieval can be defined as follows. Given a query 
Q and a document corpus Docs, find the k documents 
{Di, D2, . . . , Dk} C Docs that have the highest score, accord- 
ing to some scoring function Score(D, Q). Both the query and the 
documents are sets of terms from the same high-dimension space. 
Scoring is usually performed based on the overlapping terms be- 
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Figure 1: Document corpus as a matrix V 



tween the query and the document (i.e., the intersection between 
the document and the query terms). The document corpus Docs 
can be represented as a two dimensional matrix, denoted as V, with 
m terms and n documents (FigureQJ. In general, the values of ele- 
ments in V measure how strongly the terms are associated with the 
corresponding documents. For example, one measure is the term 
frequency, which is the number of occurrences of a term in a doc- 
ument. Given an example query Q in FigureU the shaded portion 
of the matrix is used for its evaluation. 

Inverted indices are the prevailing implementation of scalable 
top-fc retrieval. In an inverted index, each term T appearing in the 
corpus Docs is associated with a posting list, which enumerates the 
documents that contain T. An inverted index is a sparse represen- 
tation of the matrix V that stores only non-zero matrix elements. 

In several applications, top-fc queries are processed while the 
user is waiting for the reply, which imposes very strict bounds on 
query latency. Due to such requirements, memory-resident indices 
are becoming more popular in current search engines. To lower 
the amount of required memory, and hence the system cost, com- 
pression techniques (e.g., 1 5 6 , 22 25 26 27 1) are heavily used to 
reduce the size of the inverted indices. Compression techniques are 
mainly divided into two categories: lossless compression, where 
quality of results are not affected by the compression, and lossy 
compression, where results quality might be affected. Lossy tech- 
niques typically trade index size for retrieval accuracy 1 5 , 6 1, while 
lossless techniques exploit the properties of the document cor- 
pus for compactly encoding information in individual posting lists, 
such as documents identifiers 127 1 1261 . and term positions 1 2211251 . 

In this paper, we propose a novel lossless compression technique 
that holistically compresses multiple posting lists by taking advan- 
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Figure 2: Matrix representation for combining two terms 



tage of similarities between them. This type of compression can be 
applied before the standard per-posting list compression in order 
to combine the benefits of both methods. In spirit, the technique 
presented in this paper is related to matrix factorization methods 
such as Non-negative Matrix Factorization (NMF) I13II161 . Latent 
Dirichlet Allocation |3|, Singular Value Decomposition |23|, and 
Principal Component Analysis 1141 . among others. All of these 
techniques map the documents from the space determined by the 
original terms into a lower dimensional space. However, unlike 
previous factorization methods, we aim at providing an exact fac- 
torization of the input matrix in order to avoid any information loss, 
while reducing the number of non-zero elements in the resulting 
factors. Furthermore, we do not restrict the new space to have a 
small number of terms. 

For example, the top matrix in Figure|2]is factored into two ma- 
trices such that: (1) the product of the factors is equal to the input 
matrix, and (2) the factor matrices contain fewer non-zeros than the 
input matrix. Note that the rank of the second factor (five), is higher 
than the rank of the input matrix (two). To answer user queries, 
we use the first factor to map (i.e., rewrite) query terms from the 
original term space T to a new space r of meta-terms (e.g., Ti is 
mapped to ti,T2,T3, T4 in Figure O. The rewritten query is used 
for searching the second factor, which represents the compressed 
inverted index, using any top-fc search algorithms. 

In this paper, we prove that finding the optimal (i.e., the sparsest) 
factorization is NP-hard. We develop a greedy algorithm that effi- 
ciently finds an approximate solution. The core of the algorithm is: 
(1) efficiently identifying segments of various posting lists that are 
identical up to a multiplicative factor, (2) extracting a copy of such 
common segments representing new meta-terms, and (3) removing 
the common segments from the original lists. Term mappings are 
constructed such that the original terms are mapped to both the up- 
dated original lists (with the common segment removed) and to the 
newly created meta-terms. 

Although lossless compression helps reducing the amount of 
memory required for storing a given index, it usually incurs a com- 
putational overhead due to the need of decompressing the index 
data 1 22 26 1 . Such overhead should be minimized in order to keep 
query latency small. The computational overhead in our approach 
is due to rewriting a query using a number of meta-terms that is 
greater than the number of the original terms. For example, the 
number of meta-terms in Figure [2] is five, while the original num- 
ber of terms is two. We show how to eliminate such negative effect 
by exploiting some unique characteristics of our compression tech- 
nique. More specifically, we show that standard query processing 
approaches such as No-Random-Access (NRA) algorithm 1101 can 



be modified to search the compressed index as fast as the original 
algorithm searches the uncompressed index. 

We evaluate the proposed compression techniques on TREC 
WTlOg dataset |1|. The experiments show that our compression 
algorithm reduces the index size by up to 35%. Furthermore, in- 
tegrating our approach with a standard lossless compression tech- 
nique, namely variable-byte encoding |19|, pushes the space sav- 
ings to 50%. We show that moderate compression (e.g., 20%) in- 
curs no overhead on query evaluation performance, while higher 
compression ratios incur a negligible overhead. 

In summary, the contributions of the paper are as follows. 

• We propose a novel approach to lossless compression of in- 
verted indices that is based on exact matrix factorization. We 
prove that obtaining the optimal factorization in NP-hard. 

• We propose a greedy algorithm for exact matrix factoriza- 
tion, and show how to parallelize our algorithm using the 
MapReduce paradigm. We show that it is still possible to in- 
crementally update compressed indices with a minimal cost. 

• We demonstrate how to eliminate query evaluation overhead 
due to decompression by exploiting characteristics of the 
compressed index. 

• We experimentally evaluate our techniques on the standard 
TREC WTlOg dataset. 

The remainder of the paper is organized as follows. In Section[2] 
we introduce basic concepts and notation used throughout the pa- 
per. In Section[3] we establish the link between index compression 
and matrix factorization. Section [4] describes the proposed factor- 
ization algorithm, and how to update a compressed index to ac- 
commodate new documents. In Section[5] we show how to modify 
search algorithms to reduce query response time. The experimental 
evaluation is presented in Section|6] We discuss the related work in 
Section[7] Finally, we conclude the paper in Section[8] 

2. PRELIMINARIES 

In this paper, we use the vector-space representation of docu- 
ments and queries. That is, documents and queries are represented 
as vectors in a multidimensional space where each term is a dimen- 
sion. Let = {Ti, Tb, . . . , T m } be a set of terms. A document Dj 



is a vector (dj, dj, 



■ > u j 



When a term T, occurs in a document 



Dj , the element dj is non-zero, and its value is typically related to 
the number of times T occurs in the document. Similarly, a query 
Q is represented by a vector (w\ , , . . . , w m ) , where non-zero el- 
ements correspond to terms appearing in the query, and their values 
are term weights in the query. 

Given a document corpus of n documents Docs — 
{Di, D2, ■ ■ ■ , D n } and a query Q, a common task in many in- 
formation retrieval systems is to retrieve the k documents with the 
highest score according to some scoring function Score(D, Q). In 
this work, we assume the scoring function is defined as the inner 
product of document and query vectors. That is, 



Score(Dj,Q) = ^Td) 



(1) 



Many information retrieval systems use inverted indices as their 
main data structure for top-fc retrieval. An inverted index is a col- 
lection of posting lists Li, L2, ■ ■ ■ , L m : a list for each term in f2. 
List Li is a vector containing weights of term Ti in all documents 
(i.e..L 4 = (<*!,<£,...,£)). 
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For typical document collections, the majority of the values in 
posting lists are zeros. Thus, inverted indices use sparse representa- 
tion, where zero entries are omitted. Specifically, each posting list 
Li contains postings of the form (docID, payload), where docID 
is the document identifier Dj, and the payload contains the (non- 
zero) value dj. 

Given a query Q, top-fc search algorithms use posting lists of 
terms that have non-zero weights in Q to obtain the top-fc docu- 
ments. A naive algorithm would examine all entries in the relevant 
posting lists, compute the scores of found documents, and return 
the top-fc documents. However, the total number of documents in 
the relevant posting lists is typically much larger than fc, especially 
when Q contains frequent terms. Many top-fc search algorithms 
(e.g., |4. 10|) aim at retrieving the top-fc documents while examin- 
ing only a fraction of entries in the relevant posting lists. 

REMARK In this work, we assume that the payload does not con- 
tain additional information such as term position, typesetting, etc. 
Primitive payloads (i.e., consisting of only term frequencies) can 
be found in many large-scale applications such as computational 
advertising, where documents are relatively short, and extra infor- 
mation such as term positions are not informative. Moreover, some 
applications employ a two-tier retrieval. That is, candidate results 
are first obtained from an inverted index consisting of primitive in- 
formation only. In the second phase, candidate results are re-scored 
using a more accurate function that considers additional informa- 
tion. The additional information is typically fetched from a forward 
index, and does not need to be stored in the inverted index. 

3. INDEX COMPRESSION AS MATRIX 
FACTORIZATION 

We represent an index by an m x n term-document matrix V, 
where rows are posting lists and columns are document vectors 
(Figure [TJ. We denote by V[T, D] the value of the element in V 
corresponding to a row T and a column D. We use | V||o to denote 
the number of non-zero elements in V. Top-fc retrieval corresponds 
to computing a score vector S that is equal to the product of a query 
vector Q and the matrix V, and picking the top-fc documents with 
highest scores. Formally, S T = Q T V, where the superscript T 
denotes the transpose operator. 

Typical document collections, such as Web corpora, contain re- 
dundant elements in their term-document matrices due to duplicate 
or near-duplicate contents. For example, news articles are usually 
shared across multiple Web sites. In this case, two documents D x 
and D y that refer to the same article would contain several iden- 
tical sentences consisting of terms T a , Tj, . . . , T p . Consequently, 
the term-document matrix would contain two identical sets of val- 
ues: V[T a ,D x ],...,V[T p ,D x ], and V[T a , £>„], V[T P , D y \. 
Another example that leads to redundancy in term-document ma- 
trix is co-occurrence of subsets of terms in multiple documents. For 
example, terms "Britney" and "Spears" usually co-occur in docu- 
ments related to music. In this case, the term-document matrix will 
contain two identical sets of values: Vpi, D a ], ■ ■ ■ , V[T X , D p ], 
and V[T y , D a ], . . . , V[T y , D p ], where T x and T y are co-occurring 
terms in a set of documents {D\ , . . . , Dp}- 

A known technique for reducing redundancy in a matrix is ma- 
trix factorization. The simplest form of factorization is decompos- 
ing V into two matrices: an m x r matrix W and anrxn matrix 
H, such that V = WH. Note that since our goal is lossless in- 
dex compression, we consider the exact formulation and not the 
approximate one (V m WH). In our case, the objective function 
is to minimize the total number of non-zero elements in W and H 
(i.e., ||W||o + H-Hllo). 



Intuitively, factoring V into WH transforms the set of terms 
into another space, denoted O, consisting of r meta-terms 
{n , T2, . . . , r r }. Matrix W linearly maps terms in Q to meta-terms 
in Q (and vice-versa), while matrix H represents the inverted in- 
dex of Docs in the space of meta-terms. Figure|2]is an illustration 
of such a factorization, where terms {Ti, T2} are linearly mapped 
into meta-terms {n, . . . , T5} using matrix W, and documents are 
represented as combinations of these meta-terms in matrix H. Note 
that although r > m (i.e., the number of rows in H are greater than 
the number of rows in V), the number of non-zeros in W and H is 
less than the number of non-zeros in V. 

Evaluation of query Q is performed on the inverted index repre- 
sented by H, after rewriting Q according to W. Specifically, we 
rewrite the query vector Q into vector Q' such that Q' T = Q T W. 
In other words, each term T with non-zero weight in Q is replaced 
by a set of meta-terms {r : W[T, r] 7^ 0}. The weight of each 
term r in Q' is w ■ W[T, r], where w is the weight of T in Q. 
Once Q is rewritten into Q', any standard search algorithm can be 
used to retrieve the top-fc documents from the compressed index H 
using query Q' . The following theorem proves that searching the 
original inverted index using Q is equivalent to searching the index 
represented by H using Q'. 

THEOREM 1. Let W and H be the result of factoring V (i.e., 
WH — V). Let A(V, Q, k) be the top-k documents for query Q 
using inverted index V and the scoring function in Equation\l\(ties 
in scores are broken by some predefined criteria). Let the rewrit- 
ten query be Q' such that Q lT = Q T W. Then, A(V, Q,k) = 
A(H,Q',k). 

PROOF. Let Score(Dj, Q, V) denote the score of a document 
Dj, given a query Q and an inverted index represented by ma- 
trix V. Since the top-k results are selected based on document 
scores computed using Equation Q] we only need to show that 
Scare(Dj,Q, V) = Score(D Jy Q', H) for all Dj. The value of 
Score(Dj, Q, V) can be rewritten as the dot product Q ■ V[\, Dj], 
where V[:, Dj] denotes the vector corresponding to the column Dj 
in V. Then, 

Score(Dj,Q, V) = Q ■ V[:, Dj] = Q T V[:, Dj] 

= Q T WH[:,Dj]=Q' T H[:,Dj] 

= Q' ■ H[:,Dj] = Score(Dj,Q',H). 

□ 

An immediate consequence of TheoremQjis that standard top-fc 
algorithms can still be used for searching the compressed indices 
without any loss in precision or recall. 

Typically, inverted indices contain non-negative term-document 
weights, e.g., reflecting term frequencies. The non-negativity of 
weights is exploited in some top-fc retrieval algorithms such as 
WAND [4|. In order to be able to use such algorithms over com- 
pressed indices, we aim at preserving the non-negativity of V in 
the factor matrices W and H. 

REMARK It is possible to interpret the intermediate space 8 as a 
space of meta-documents, rather than meta-terms. In this case, the 
matrix W represents an inverted index of meta-documents in the 
original term space Q, and H is a mapping from meta-documents to 
documents in Docs. However, under such interpretation, existing 
top-fc retrieval algorithms that employ early termination cannot be 
used on W, since top-fc meta-documents do not necessarily contain 
the top-fc documents. For example, in Figure [2] suppose that the 
entities n, . . . , rg represent a set of meta-documents, and suppose 
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we want to compute top-1 document for query Q = (1,0), which 
is Di. Applying the top-fc algorithm to W returns T3, which is 
then mapped to the document set {-Dg, Dg} that does not include 
the correct top-1 result D2. 

4. SPARSE MATRIX FACTORIZATION 

In this section, we consider the following exact sparse matrix 
factorization problem. Given a matrix V, obtain W and H that 



minimize [|W|| + ||#||o 
subject to WH = V 

Unlike typical factorization problems, we do not impose any re- 
striction on the dimensionality r of the intermediate space (i.e., 
|0|). In particular, it is allowed to be higher than max(n, m). Re- 
quiring r to be much lower than the original dimensions n and m 
prohibits sparse and exact solutions, which is required in our prob- 
lem. Thus, existing factorization techniques are not appropriate 
(see more details in Section[7), and we resort to developing a new 
factorization approach. 

The following theorem states that the problem of obtaining the 
sparsest exact factorization is NP-hard. 

THEOREM 2. Given a matrix V, the problem of obtaining two 
matrices W and H, subject to the constraint WH = V, such that 
||W||o + ||-ff||o is minimum is NP-hard. 

PROOF. We prove the claim by reduction from the NP-complete 
SPARSESTVECTOR problem (TTJ: given a full rank m x n matrix 
A, and anmxl vector b, find an n x 1 vector x with minimal ||a;||o 
such that Ax = b. Given an instance (A, b) of the SPARSESTVEC- 
TOR, we construct an instance of the sparse matrix factorization 
problem as follows. Let V be a matrix obtained by concatenating 
A, p = n(m + 1) + 1 times horizontally, followed by the vector b: 
V — [A A ... A b]. Note that V is an m x (np + 1) matrix. 

One solution is to factor V = AB with B = [//... Ix], where 
I is the n x n identity matrix, and x is the optimum solution to 
the SPARSESTVECTOR problem. For this solution, ||B|| = np + 
||a:||o, and the total solution cost ||j4||o + ||-B||o ~ ll-^llo + np + 
\\x\\ < n(m+p + 1) . 

Consider any other solution for factoring V = WH such 
that W is a m x k matrix, and H is a k x (np + 1) ma- 
trix. Let H = [Hi H2 ■ ■ ■ H p y] . In any optimal solution 
H-Hillo = H-H2II0 = ■■• = ||iip||o- Let us assume other- 
wise, i.e., there are indices i and j with ||-Hi||o < ||i?j||o- Let 
H' = [Hi ... Hj-i Hi H j+ i ...H p y], then V = WH' and 
H-ff'llo < ||-ff||o> a contradiction with optimality of (W, H). 

Let q — \\Hi\\o, for i 6 {1, . . . ,p}. Assume that q > (n + 1), 
therefore the cost of the solution is at least: 



Ratio: 1/2 Ratio: 2 



n 2 m+nm+n 2 +2n+l. 



qp = (n+l)p = (n+l)(n(m+l)+l) 

But the cost of the (A, B) solution is no more than n(m+p+ 1) = 
n(m+n(m+l)+2) = n(m+nm+n+l) = nm+n 2 m+n 2 -\-2n. 
Therefore the presented a solution is no longer optimal, which is a 
contradiction. Thus, for any optimal solution, q < n + 1. 

Observe that since A is full rank, q > n. Therefore, in any 
optimal solution q = n and Hi is a permutation of the identity 
matrix, for i G {1, . . . ,p}. Therefore, y is a permutation of the 
solution to the sparsest vector problem. 

□ 

Since obtaining the optimal factorization is computationally in- 
feasible, we propose an iterative greedy algorithm for getting an 
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Figure 3: Combining term vectors (a) matrices W t and H t (b) 
matrices Wt+i and H t +i after combining ri and T2 



exact factorization that might not have the minimal number of non- 
zeros. The key idea is to start with a trivial factorization Wo — I m 
(I m is the identity matrix of rank m) and Ho = V, and iter- 
atively improve the current solution (Wt,Ht) by a sequence of 
local transformations on Wt and Ht, obtaining Wt+i and Ht+i. 
Each step is guaranteed to reduce 1 1 Wt \ o + 1 1 H t 1 1 o while preserving 
two invariants: (1) W t H t = V, and (2) ||W t+ i|| + ||#t+i||o < 
I Wt 1 1 o + \\Ht 1 1 o - Although our iterative algorithm does not neces- 
sarily reach an optimal solution, it achieves significant compression 
ratios after a few iterations (see Section|6j. 

The transformation performed at each step is based on the obser- 
vation that correlated terms and documents induce correlated val- 
ues in columns and rows of V. At step t, given matrices Wt and 
H t , the algorithm looks for a submatrix HI of H t defined by a sub- 
set R of H t 's rows, and a subset C of Ht's columns, such that the 
rank of the submatrix is one. That is, all rows of H" are multiples 
of each others: 



V(Ti, Tj) £ Rx R, \/(D p , D q )eCxC 
V[Ti,Dp] = V[n,D q ] 
V[r 3 ,Dp\ V[r h D q ] 



(2) 



Clearly, keeping only one representative row from Ht and en- 
coding other rows in H" as multiples of the representative row 
would reduce the number of non-zero values in H t ■ Unfortunately, 
identifying the largest submatrix H| is equivalent to the problem 
of finding the largest bi-cluster 1181 , which is an NP-hard problem. 
Thus, our algorithms considers only submatrices consisting of two 
rows (i.e., \R\ = 2) at each step. 

For efficiency, our algorithm identifies z rank-1 submatrices at 
each iteration that are composed of two rows R — {ri, Tj} and z 
sets of columns Ci, . . . , C z (i.e., Equation [2] holds for submatrix 
(R, Ci) through (R, C z )). Then, rows n and Tj can be rewritten 
as linear combinations of a set of z common subvectors, denoted 
T r +i, ■ ■ ■ ,T r +z, and two remainder vectors r r + z +i and r r + z +2 
that contain values of documents that are not in Ci U • • • U C z . 
More specifically, 

Ti = ai ■ T r + l + OL2 ■ T r + 2 + ••• + ««• T r + Z + T r + £ + l(3) 
Tj = /3l ■ T r + 1 + fa ■ T r + 2 + h fix ' T r + z + T r + z + 2 (4) 

Vectors T r +i, ■ ■ ■ , T r +z+2 are appended to matrix H t , and 
vectors n and Tj are removed from H t , resulting in ma- 
trix H t +i. Matrix Wt is modified to map original terms to 
T r +i, ■ ■ ■ , T r +z+2 instead of Ti,Tj, resulting in matrix Wt+i- Al- 
gorithm Q] describes the procedure in more details. Function 
GetCorrelatedSubmatrices(Ht), which we describe in Sec- 
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tion 14.11 is responsible for extracting the sets R and Ci , . . . , C z 
that maximize space saving. 

Figure [3] shows an example of combining two meta-terms n 
and T2 into common subvectors T3,T4,T5, and remainder vec- 
tors Te,T7. Without loss of generality, we assume hereafter that 
Px = ■ ■ ■ = fiz = 1. We ensure that a p ^ a q for p ^ q (other- 
wise, we combine r r + p and T r + q into one subvector). An important 
consequence is that Ci , . . . , C z are pairwise disjoint. We rely on 
this property to improve query evaluation performance (Section[5j. 



Algorithm 1 ComputeFactorization (V) 

1: Wo <- I m \ Ho <- V 
2: r <— m 
3: t 4- 
4: repeat 

5: Ht+i <— Ht 

6: (R, Ci, . . . , C z ) -s— GetCorrelatedSubmatrices(Ht) 
7: if failed to find correlated submatrix then 
8: break 

9: remove rows R = {Ti,Tj} from Ht+i, and add the new 

rows T r+ i, . . . , T r+z+2 to i? t+ i 

10: construct an r x (r + z + 2) transformation matrix Wm that 
linearly maps n and Tj to r r +i, . . . , r r + z +2 using Equa- 
tions[3]and[4] and trivially maps all other meta-terms in Wt 
to themselves. 

11: W t+ i<-WtW M . 

12: r «- r + z + 2 

13: t<-t + l 

14: until [|W t ||o + \\H t \\o converges 
15: return W t ,H t 



4.1 Identifying Correlated Submatrices 

The goal of function GetCorrelatedSubmatrices(Ht) is to 
return correlated submatrices, defined by 7? = {i~i,Tj} and 
Ci , . . . , Cz . Our algorithm heuristically finds the submatrices that 
would result in the highest reduction of space. In the following, we 
describe how to find the sets C\ , . . . , C z given R, formulate the 
potential saving from combining two given meta-terms, and finally 
how to find R. 

First, we show how to compute Ci, . . . , C z , given the two meta- 
terms n and Tj to combine. Denote by r[p] the value of the element 
at index p in a row r in H t . For two rows r,: and Tj in Ht, we 
compute a vector 7 of length n as follows: 

J[q] = ) T S [Q] 

( otherwise 

for 1 < q < n. 

Each set C p is a subset of documents (columns in H t ) that have 
the same non-zero value in 7. For example, in Figure [5] the first 
four cells have the same value in 7, namely 2/3, and thus constitute 
a common subvector (T3). 

The space saving resulting from combining Tj and Tj is com- 
puted as follows. Combining Tj and Tj in Algorithm[Tjreduces the 
number of non-zero elements in H t by Ylp=i \@p\ t> ecause eacn 
subvector r r + p corresponding to C' p is stored twice in H t and only 
once in H t+1 . On the other hand, combining Tj and Tj means that 
all the terms in Wt that were mapped to either Tj or Tj are now 
mapped to additional z meta-terms (i.e., Ti-+i, . . • , r r + z ) in Wt+i, 
which increases the number of non-zero elements in Wt+i. For ex- 



ample, in Figure[3] T\ is originally mapped to t\. After combining 
ti and T2, T\ is mapped to extra 3 meta-terms, namely T3, T4, T5, 
(besides the remainder meta-term Te), which results in three ad- 
ditional elements in W. Formally, the overall space saving when 
combining Tj and Tj is: 



saving(Ti,Tj,Ci, . . . ,C z ,W t ) = ^ \C P \ (5) 

P =i 

- z -|{reSJ: w t [T, t<] ^ v w t [T, Tj] / o}| 

In the following, we describe how to efficiently identify a pair 
of rows in H t , denoted by R, with the highest potential savings. 
A straightforward approach is to compute the potential space sav- 
ing, based on Equation [5] for all pairs of rows and return the pair 
with the highest savings. Unfortunately, the complexity of such 
approach is quadratic in the number of rows in H t , which is pro- 
hibitively expensive. To reduce the number of pair-wise compar- 
isons, we use a blocking technique to prune a large number of pairs 
that have low potential space saving. Blocking techniques have 
frequently been used in clustering algorithms that rely on pairwise 
comparison (e.g., [20]). The main goal of a blocking technique is 
to partition the set of objects into multiple blocks such that "sim- 
ilar" objects are placed in the same block. Thus, we only need to 
compare pairs of objects that belong to the same block. 

Recall that our distance metric for comparing two rows Tj and Tj 
is the potential reduction in space resulting from combining them. 
Let Ti l~l Tj be the set of documents that have non-zero value in both 
Ti and Tj in H t . The maximum possible savings can be obtained 
when all elements in Tj n Tj are placed in the same common sub- 
vector. Therefore, we use the overlap between rows (i.e., |Tj D Tj |) 
as an upper bound of the potential savings. We thus place the rows 
with high overlap in the same block. 

Since computation of overlap is expensive because of the large 
vector lengths, we approximate it using sketching. We parti- 
tion documents {Di, . . . , D n } into A disjoint groups, denoted 
G\, . . . , G\, by assigning each document to a randomly selected 
group. For each row Tj, we compute a A-dimensional vector Si 
such that Si [p] , 1 < p < A, is equal to the number of documents 
in G p that are associated to Tj in Ht- Vector Si is the sketch of 
Ti. The blocking algorithm picks the dimension dim 6 {1, . . . , A} 
with the largest variance across sketches Si, . . . , S r - Dimension 
dim is used for splitting the rows into two blocks such that the first 
(respectively, second) block contains rows with value of dimension 
dim below (respectively, above) the median. The algorithm recur- 
sively applies the same process until block sizes are smaller than 
a predefined threshold B. We experimentally analyze the effect of 
parameters A and B in Section [6] Once sufficiently small blocks 
are identified, we find a pair of rows that maximizes space savings 
by brute-force computation in each block. 

4.2 Limiting the Number of Subvectors 

Recall that our compression approach iteratively reduces the 
index size at the cost of increasing the number of meta-terms. 
That is, there is a trade-off between the space savings and the 
increase in the number of meta-terms. In particular, it may not 
be worthwhile to introduce new meta-terms whose space sav- 
ings are below some threshold. We note that the length of 
a new meta-term T r + P represents its maximum potential sav- 
ing, according to Equation [5] Therefore, we modify algorithm 
GetCorrelatedSubmatrices(Ht) such that it generate a new 
meta-term only if its length is greater than or equal to a threshold 
(i. Consider the example depicted in Figure[3]and let /1 = 3. Then, 
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only the first two meta-terms T3 , T4 would be generated, while the 
third meta-term T5 would not be generated (i.e., it becomes part 
of the remainder meta-terms tq, Tr). In Section|6] we experimen- 
tally analyze the effect of /1 on the number of meta-terms and the 
compression ratio. 

4.3 MapReduce Implementation 

Even moderate inverted indices consist of millions of documents 
and terms, making sequential implementation of our iterative al- 
gorithm impractical. To scale the algorithm to large matrices, we 
parallelize it according to the MapReduce model 1151 . 

In each iteration of AlgorifhmQ] we combine two rows of matrix 
Ht, generating z + 2 new rows in matrix Ht+i and updating values 
in matrix Wt+i. These operations are performed independently of 
other rows in H t . Moreover, the rows and values written to Ht+i 
and Wt+i depend only on the rows being combined. These ob- 
servations allow the following parallelization scheme: (1) identify 
several disjoint pairs of rows, (2) combine all pairs in parallel and 
emit the new meta-terms and their coefficients, and construct Ht+i 
and Wt+i- Algorithm[2]describes the details of the procedure. 

Since it is impractical to run the algorithm until full convergence, 
we use a parameter 8 to control the number of iterations. Once the 
space savings resulting from the current iteration is below <5, the 
algorithm stops and returns the current matrices Wt and H t . Func- 
tion GetC orrelatedRowsM R{H t ) obtains a set of independent 
row pairs {(Ti, Tj), ■ ■ . } in Ht to combine, such that the overall 
savings is maximized. The function computes sketches of the rows 
in Ht in parallel (map phase), and then partition the rows into 
blocks according to sketches (reduce phase). In each block, the 
potential space savings from combining each pair of rows is com- 
puted, and independent (i.e., disjoint) pairs with the highest savings 
are selected using a technique from [ 12l . 

Algorithm 2 ComputeFactorizationMR (V) 

1: Wo <- I m 

2: Ho <- V 

3: r <s— rn 

4: t <- 

5: repeat 

6: {(n, Tj), . . . } •(- GetC orrelatedRowsM R(H t ) 

7: {H t +i y W t+1 ) ^ CombineMRdiT^rj), . . .},W t , H t ) 

8: t«-t+l 

9: until ag^+p^J°W^W < g 



10: return W t ,H t 



l|Wt-l||o+l|H t _i|| ) 



Function CombineM R({(ji, Tj), . . . }, Wt, H t ) computes ma- 
trix Ht+i by combining pairs of terms that are obtained 
by GetCorrelatedRowsM R(H t ). The set of row pairs 
{(Ti, Tj), . . . } is cached at each mapper/reducer. A map task as- 
sociates each row in Ht to a key referring to the row pair it belongs 
to (or to itself if it is not part of any pair). Then, rows that belong to 
the same row pair are grouped at the reduce task, and the resulting 
rows of Ht+i are computed. A term-transformation matrix Wm 
is computed analogously. Finally, the matrix Wt+i is computed as 
the product WtWu by caching Wm at all mappers/reducers that 
process rows of Wt and outputs the rows of Wt+i. Based on the 
overall space reduction and the parameter S, the algorithm decides 
whether to start a new iteration or to terminate. 

4.4 Updating the Compressed Index 

Document corpora that are extracted from the Web are frequently 
updated due to the constant flow of new documents, removing ob- 
solete documents, and modifying existing documents. The fre- 



quency of such updates requires the ability to update inverted in- 
dices incrementally, without rebuilding the entire index on each 
update. In the following, we describe how to incrementally up- 
date factorized indices. We focus on two operations: adding new 
documents and removing existing documents. Updating an exist- 
ing document can be implemented by removing the old version of 
the document and adding the new version to the index. 

In regular inverted indices, adding a new document is imple- 
mented by assigning the document a new document identifier and 
inserting a posting into the posting list of each term that appears in 
the document. The insertion position in the posting list depends on 
how posting lists are ordered. 

Recall that our compression approach maps each term Ti 6 Q 
to a set of meta-terms, denoted M(Ti) C O, through the ma- 
trix W. At least one meta-term in M(Tt) is a remainder meta- 
term, denoted Rem(Ti), that is uniquely mapped to Ti. That is, 
W[Ti,Rem(Ti)\ = 1, and Vj / i(W[T h Rem(T^\ = 0). It 
is straightforward maintain a mapping T; — ¥ Rem(Ti) during the 
compression procedure. In order to add a new document D that 
mention term Ti, it is sufficient to add a new posting to the post- 
ing list of Rem(Ti). Thus, it is possible to accommodate frequent 
insertions of new documents through maintenance of remainder 
meta-terms only. Note that adding postings to remainder meta- 
terms does not allow the maximum possible space saving that can 
be achieved by rebuilding the compressed index from scratch. The 
reason is that redundancy in newly inserted documents is ignored. 
It is possible to reduce the overhead of a full index rebuild by con- 
tinuing the compression algorithm from the current matrices W and 
H rather than starting with Wo = I m and Ho = V (lines 1 and 2 
in Algorithm!!} . 

Removing documents from the compressed index is achieved by 
removing all postings in the compressed index that refer to the re- 
moved document. This is equivalent to removing the entire column 
in H that corresponds to the removed document. 

5. OPTIMIZING QUERY PROCESSING 

A possible side-effect of our compression approach is having a 
number of meta-terms in the rewritten query that is larger than the 
number of terms in the original query. Such increase can be quite 
significant as we demonstrate in Section l&2l and can lead to a no- 
ticeable increase in query evaluation time. 

In this section, we show how to mitigate this undesirable effect 
by exploiting unique characteristics of our compression scheme 
to improve the efficiency of typical top-k query processing algo- 
rithms. As a case study, we show how to modify the Non-Random- 
Access algorithm (NRA) 1101 . Note that there exist a plethora of 
search algorithm that might be more efficient than NRA, especially 
for memory-resident indices. However, we chose the NRA algo- 
rithm mainly because of its simplicity to describe and analyze. The 
observations in this section can be exploited to adapt other search 
algorithms, provided that they use similar primitives to access post- 
ing lists. 

We denote by L\, . . . , the posting lists corresponding to the 
terms with non-zero weight in query Q, where h = ||Q||o. and let 
Wi, . . . , Wh be the weights associated with L\ , . . . , Lh in Q. The 
score of a document D can be rewritten as follows: 



Score{D,Q) = Y^ W * ■ L ^ D ) 



(6) 



where Lt(D) denotes the weight of document D in list Li. The 
NRA algorithm requires posting lists to be sorted in descending 
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order of document weights. We assume that weights of the query 
terms wi, . . . ,Wh are positive and that the number of documents in 
the corpus is greater than the number of documents to retrieve (fc). 

The NRA algorithm retrieves documents from lists Li, . . . , Lh 
in a round robin order. The key insight that allows early termi- 
nation is that having the lists sorted enable computing upper and 
lower score bounds for document scores. Every time a document 
is retrieved, the lower and upper bounds of retrieved documents, as 
well as unseen documents, are updated. Once there exist k doc- 
uments whose lower bounds are greater than or equal to the up- 
per bounds of all other documents (including both seen and unseen 
documents), the algorithm terminates. 

Score bounds are computed in the NRA algorithm as follows. 
Let Xi denote the weight of the last document retrieved from list 
Li, if Li is not completely read by the algorithm, or otherwise. 
During execution of the algorithm, the score upper bound of each 
retrieved document D, denoted Score(D, Q), is computed as fol- 
lows: 

h 

S^(D,Q) = J2 W ^ T ^ D )- ( 7 > 

where Li(D) denotes the weight of D in list Li, if D has appeared 
in Li, and Wi otherwise. The upper bound for unseen documents is 
Yli=i Wi ' Similarly, the lower bound of each retrieved docu- 
ment D, denoted Score (D, Q), is: 

h 

Scgre(D,Q) = J2 W *- M D ) ( 8 ) 

i=l 

where Lj{D) denotes the weight of D in list Li, if D has appeared 
in Li, and otherwise. 

In the following, we describe our modifications to the NRA al- 
gorithm. Observe that the score upper bound of each retrieved doc- 
ument D is computed by assuming that each undiscovered weight 
Li(D) is equal to Xi (Equation[7}- Recall that all meta-term lists 
corresponding to the same original term are disjoint (Section©. 
Thus, it is possible to compute a tighter score upper bound by set- 
ting undiscovered weight Li(D) to zero, instead of Xi, if D has 
appeared in any list Lj such that Li and Lj are disjoint. 

Therefore, instead of considering lists of meta-terms indepen- 
dently, we create a two-level document retrieval scheme as follows. 
We create a virtual list for each original query term T . Each virtual 
list is traversed by probing the disjoint lists of the corresponding 
meta-terms. We use a priority queue PQi to implement the virtual 
list of Ti. 

Algorithms [3] and [4] describe how to initialize a priority queue 
and how to get next document, respectively. Let M(T) = {r : 
W[Ti, t] 7^ 0} be the set of meta-terms that term Ti is rewritten 
into. Initialization of a priority queue PQi is performed by insert- 
ing a pair (r, D) for each meta-term r in M(Ti), where D is the 
document with the highest score in r. The score of each pair (r, D) 
in the queue is equal to the score of D in list r multiplied by the 
weight W[Ti, r]. Retrieving next document from the virtual list of 
Ti is equivalent to retrieving the document in the pair (r, D) at the 
head of the priority queue. After each retrieval from PQi, we in- 
sert a new pair (r, D') in PQi, where D' is the next document in 
T. 

Modifying the NRA algorithm to use the virtual lists is straight- 
forward: instead of initializing posting lists, the algorithm initial- 
izes priority queues PQi, . . . , PQh for the original query terms 
Ti, . . . ,Th, and retrievals from each list Li are replaced by re- 
trievals from the corresponding virtual list. The following theo- 
rem proves the correctness of the modifications, and gives an upper 



Algorithm 3 Initialize_PQ(Tj, W) 
Require: T: A term in the original query Q 
Require: W: The term rewriting matrix 

1: M(T)^{t:W[T,t}^0} 

2: Define a priority queue PQi (initially empty) 

3: for each r € M(T) do 

4: Retrieve the first document D from r 

5: Insert into PQi a pair (r, D) with score equal to the score 
of D in r multiplied by W[Ti, t] 

6: return PQi 



Algorithm 4 GetNextDoc(PQi) 

Require: PQi: Priority queue associated with query term T 
1: if PQi is empty then 
2: return NULL 

3: Remove the pair (r, D) with score s from the head of PQi 

4: if t is not exhausted then 

5: Retrieve next document D' from list r 

6: Insert (r, D') into PQi with score equal to the score D' in 

r multiplied by W[Ti, t] 
7: return document D, score s 



bound on the runtime overhead of the modified NRA algorithm. 

THEOREM 3. Let P be the number of probes performed by the 
NRA algorithm when processing query Q using the original un- 
compressed index V. Let P be the number of probes performed 
by the modified NRA algorithm when processing the same query Q 
using a compressed index H and a term rewriting matrix W such 
that V — W H. The top-k results returned by both algorithms are 
the same. Furthermore, 

p' < p+ i M ( T *)i- 

PROOF. First, we prove that the top-fc documents returned by 
the modified NRA algorithm are the same as those returned by the 
unmodified NRA algorithm using the uncompressed index. Since 
the modified NRA algorithm differs from the unmodified NRA 
only in document retrieval, we only need to prove that the sequence 
of documents retrieved from list Li in the original index is equal to 
the sequence of documents retrieved from the priority queue PQi 
using the compressed index through Algorithms[3]and[4] 

Since V = W H, a row corresponding to a term T in V is equal 
to a linear combination of rows M(Tj) in H, where the coefficients 
are in the row T in W. That is, 

VD G Docs, V[Ti,D] = W ^> r l ■ H \- T > °] <?) 

Since all meta-terms (i.e., rows) in H corresponding to the 
same term in V are disjoint, the value V[T, D] can be written as 
W[Ti,r] ■ H[r, D], for the unique meta-term r £ M(Ti) satis- 
fying H[t, D] / 0. Algorithms [3] and [4] reconstruct the list of T 
by computing the weight of each document D in the list of T as 
soon as D appears in a list r £ M(T). Moreover, the priority 
queue returns documents in descending order of their scores. This 
proves the equality of document sequences retrieved from Li and 
from PQi. 

Now, we prove the relationship between P and P' . Let depthi 
be the number of documents retrieved from Li before termination 
of the NRA algorithm when running on the uncompressed index. 
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We need to prove that the maximum number of documents retrieved 
from meta-term lists M(Tj) is depthi + \M(Ti)\. The initializa- 
tion of the priority queue PQi results in \M(Ti)\ probes (lines 3, 
4 in Algorithm [3} . Each retrieval from the priority queue results 
in at most one additional probe (line 5 in Algorithm |4J. Thus, 
retrieving depthi document from priority queue requires at most 
depthi + |M(Tj)| probes. The total number of probes performed 
by the modified NRA algorithm is at most 

depth, + \M(Ti)\ = P + Y \ M i T i)\- 

□ 

Note that some meta-terms in the rewritten query might be 
shared across multiple terms in the original query (i.e., M(T;) CI 
M (Tj) / f° r Ti,Tj G Q and i 7^ j). It is possible to further 
reduce the number of probes performed by the modified NRA by 
keeping each meta-term r in exactly one priority queue PQi of a 
term Ti such that r £ M(Tj), and removing occurrences of r in 
other priority queues. Details are omitted due to space constraints. 

6. EXPERIMENTAL RESULTS 

In this section, we experimentally evaluate our approach. We 
evaluate the compression ratio for a representative document cor- 
pus and we show that the impact on query execution time is negli- 
gible. Finally, we investigate the effect of various parameters of the 
compression algorithm. 

We do not directly compare our approach to other lossless com- 
pression techniques that compress posting lists individually be- 
cause both approaches (i.e., holistic and per-list compression) can 
successfully be integrated to achieve better overall compression. To 
asses such fact, we show that our compression approach is nearly 
orthogonal to, and hence can be integrated with, a common com- 
pression technique, namely variable-byte encoding. 

We perform our evaluation on memory-resident indices since this 
is the dominant approach in current large scale applications. The 
memory capacities of modern machines allow in-memory serving 
even from large corpora such as the entire Web |7|, by partitioning 
the index across multiple machines. In such a setting, our technique 
can be applied to each index partition individually. 

6.1 Setup 

Our index factorization algorithm ran on a Hadoop cluste£]. 
Query evaluation latency was measured by a single-threaded Java 
process running on an Intel Xeon 2.00GHz 8-core machine with 
32GB RAM. Both compressed and uncompressed indices were 
preloaded into RAM prior to query evaluation. We used TREC 
WTlOg document corpus 1 1 1, which contains 1.7M documents. We 
indexed only the textual content of the documents and discarded 
HTML tags. We removed the least frequent terms that appear in 
less than three documents, thus reducing the number of unique 
terms from 5.4M to 1.6M. These rare terms account for less than 
1 % of the index size so the effect of their removal on index com- 
pressibility is negligible. In the indices we constructed, each post- 
ing contains 4-byte integer for doclD and 4-byte integer for pay- 
load. For query workload, we used 50,000 queries that are ran- 
domly selected from the AOL query log 1211 . 

Unless specified otherwise, we used the following default pa- 
rameter values. Block size B (i.e., the maximum number of rows 
in a block) is set to 500. Sketch length A is set to 1, which means 

' jhttp : / /hadoop ■ apache . org| 



that row sketch is simply the number of non-zero elements in the 
row. Minimum savings threshold /i is set to 100. 

For measuring the compression ratio of our approach, we com- 
pute the relative reduction in the space required for storing the in- 
verted index: (Uncompressed Index Size - Compressed Index Size) 
/ (Uncompressed Index Size). We consider both matrices W and H 
when computing the size of a compressed index. Note that if indi- 
vidual posting lists are not additionally compressed by any method 
(e.g., var-byte encoding), the relative reduction in space is equiva- 
lent to the reduction in the total number of non-zeros in index, i.e., 

WE 

6.2 Results 

In this section, we show the results of our experiments. 

Compression Performance. We selected two compressed indices 
obtained after 8 and 35 iterations of our algorithm. Figure [4] shows 
the compression ratio for the two indices. We observe that after 8 it- 
erations, our factorization algorithm compresses the index by 20%, 
while applying var-byte encoding to the compressed index results 
in an overall compression of 46%. At iteration 35, the compression 
ratio reaches 29% and 50%, respectively. The size of matrix W, 
which maps the original terms to the meta-terms, is less than 1% of 
the compressed index size in all iterations. 

By lowering the saving threshold /i to 0, our approach archives a 
compression ratio of 35% after 30 iterations (Figure Ob)). 

When limiting the number of mappers/reducers to 100 per each 
job, each iteration took 22 minutes in average. The runtime of the 
first few iterations is slightly above average (e.g., the first iteration 
took 27 minutes, while the second iteration took 23 minutes). 

Query Evaluation Latency. Figure [5] shows the average query 
latency for different numbers of retrieved documents (k) using the 
compressed indices at iterations 8 and 350 We do not show the 
latency of the unmodified NRA algorithm on the compressed in- 
dices as it is orders-of-magnitude higher and would distort the plot. 
The inefficiency of the unmodified NRA is due to the fact that the 
computed score bounds are very loose, which prevents early ter- 
mination. Our modifications to the NRA algorithm eliminates the 
overhead in query evaluation and results in nearly the same per- 
formance of the unmodified NRA on the original uncompressed 
index. In some cases, searching a compressed index outperforms 
searching the uncompressed index (e.g., for k = 20, the latency 
on the index compressed by 20% is 6% lower than the latency on 
the uncompressed index). Thus, the optimal compression-latency 
tradeoff is achieved after only a few iteration. 

Size of the Factor Matrices. Figure [6] depicts the relative num- 
ber of non-zero elements in W and H compared to the number of 
non-zeros in V at various iterations of the compression algorithm. 
Observe the monotonicity of the curve due to the property of our 
algorithm that never increases the number of non-zero elements in 
the factors. Note that matrix W, which is used for query rewrit- 
ing, is much smaller than H (e.g., ||WJ|o is less than 1% of ||V ||o 
at iteration 35). We see that the first few iterations are the most 
productive, while the benefit after the tenth iteration is marginal. 

Integration with Variable-byte Encoding. In this experiment, 
we show the behavior of variable-byte encoding 1 19 1 when applied 
to the resulting compressed index. Figure [7] shows the effective- 
ness of the encoding at various compression ratios of our factor- 
ization algorithm. We observe that the two techniques complement 
each other as they exploit different properties of the data. That is, 

2 We did not evaluate the effect of var-byte encoding on query la- 
tency, which is studied in prior works. 



8 



N 



100% 
80% 

c 60% 

c 

o 

tS 40% 
u 

CD 

°C 20% 
0% 





— nnz(H) 
— nnz(W) 
— -nnz(W)+nnz(H) 













10 20 30 
Iterations 



40% 




80% 

o 
c3 

* 60% 
1 40% 



-x- Factorization 
^Factorization + Var-byte 




Figure 6: Relative reduction in non-zeros 



0% 10% 20% 30% 
Compression Due to 
Factorization 

Figure 8: Compression ratio due to factor- 
Figure 7: Effectiveness of var-byte encoding ization and var-byte 




10 20 30 
Iterations 

(a) 



10 20 30 
Iterations 
(b) 



40 



Figure 
(b) the 



9: The effect of /i on (a) the number of meta-terms, and 
compression ratio 



□ Factorization 
■ Factor. + var-byte 




□ 0% Cmpr, original NRA 

□ 20% Cmpr, modified NRA 
■ 29% Cmpr, modified NRA 



mil 



8 iterations 35 iterations 



k=10 



k=20 



Figure 4: Compression ratio 



Figure 5: Average query re- 
sponse time 



factorization-based compression has negligible effect on the effec- 
tiveness of var-byte encoding. 

Figure [8] shows the compression ratio at different iterations 
when using our compression method alone, and when applying the 
variable-byte encoding to the index generated by our factorization 
algorithm. The combination of the two techniques achieves com- 
pression ratio of 50% at iteration 35. 

The Saving Threshold /x. In this experiment, we analyze the 
effect of the savings threshold n on the total number of meta-terms 
in the compressed index (Figures Ufa)), and on the compression 
ratio (Figures |9J D ))- Recall that higher /j, decreases the number of 
meta-terms, and thus reducing effectiveness of the compression. 
Changing /j, from to 100 reduces the total number of meta-terms 
in the compressed index at iteration 30 from 6.5M to 1.8M. At the 
same time, the compression ratio is reduced by only 6%. 

FigurefTOlshows the dependency between the frequency of a term 
in the corpus and the number of corresponding meta-terms. Clearly, 



the higher the term frequency, the longer the term's posting list is, 
resulting in more meta-terms. 

The Block Size B. Figure QT] shows compression ratio for two 
block sizes: 10 and 500 for /j, — 0. When the block size is reduced 
by a factor of 50, the compression ratio falls by only 5%, while the 
average iteration runtime falls from 22 to 16 minutes. Despite the 
dramatic decrease in the block size, the overall runtime decreased 
by only 36% due to the overhead incurred by the other tasks such as 
row comparisons and updating W and H, in addition to the over- 
head incurred by the Hadoop framework. 

The Sketch Length A. Figure [12] shows compression ratio for 
sketch lengths of 1 and 10 for /i = 0. When combining terms, 
sketch size has no effect on the compression ratio, which means 
that blocking rows according to the number of non-zeros is good 
enough. This is due to the relatively high variability in row lengths 
(number of non-zeros), which is known to follow a power-law dis- 
tribution. This variability provides sufficient information to iden- 
tify "similar" rows. 

To investigate the potential effect of sketch length, we modified 
our algorithm to combine columns of matrix H instead of its rows 
(although this is not a viable option for index compression as ex- 
plained in Section[3]l. Columns have much less variability in their 
lengths (number of non-zeros), since the distribution of document 
length is closer to normal than to power-law. In this case, block- 
ing by column length alone is not effective, and more fine-grained 
similarity metrics (e.g., longer sketches) give better results (22% at 
iteration 7 compared to 18%). 

7. RELATED WORK 

Lossless compression of inverted indices has been an active topic 
for the past few years. Most of the developed techniques (e.g., 
variable-byte encoding, gamma-coding and delta-coding 1191 1221 ) 
aim at generating an efficient encoding of the entries in posting list, 
and thus can be integrated with our approach (cf. Section[6}. 

There are multiple techniques for lossy compression. One of the 
widely used techniques is static pruning [5] ED- Techniques that 
are based on static pruning truncate postings that have low impact 
on the results of top-A; queries. The simplest form is to remove 
postings with payload values less than a specific cut-off thresh- 
old. Clearly, lossy compression might lead to degradation in results 
quality, unlike lossless compression where quality of results is not 
affected. In general, lossy compression can be integrated with loss- 
less compression techniques in order to achieve higher compression 
ratios at the cost of lowering the quality of query results. 

Several matrix factorization approaches have been proposed 
such as Non-negative Matrix Factorization |13IU6|[T7l , Principal 
Component Analysis 1 14], K-means clustering |9|, Latent Seman- 
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tic Analysis 1 8 1, and Singular Value Decomposition 1231 . The goal 
of these techniques is to factor a given matrix into two (or three) 
factor matrices that (optionally) exhibit some level of sparseness. 
Such techniques provide a close approximation of the input matrix, 
while our approach provides an exact factorization of the input ma- 
trix. Modifying NMF algorithms to be lossless is not straightfor- 
ward. For example, one naive approach is to compute the remain- 
der matrix R = V — WH so that the matrix V can be compactly 
represented using the matrices W, H, and R (i.e., V = WH + R). 
Unfortunately, there is no guarantee that sparseness of W and H 
would lead to sparseness of R. In fact, the size of R can be larger 
than the size of V because elements in V with values equal to zero 
may have non-zero values in the product WH. 

Another line of related work in a different context, namely signal 
and image processing, considers a problem of representing a signal 
(vector) using a linear combination of a small number of basis vec- 
tors from a dictionary (e.g., [2 24 1). The problem of selecting the 
optimal dictionary given the set of signals is similar to the problem 
we consider, with two major differences: (1) the dimensions of the 
factor matrices are selected in advance, and (2) the sparseness is 
required only from the encoding vectors (matrix W) and not from 
the basis vectors (matrix H). 

Another related problem is discovering biclusters in two- 
dimensional data (refer to [ 18 1 for a comprehensive survey). The 
goal is to discover the largest bicluster (submatrix) that exhibits 
certain characteristics (e.g., have the same value, or follow ad- 
ditive/multiplicative patterns). Computing the largest bicluster is 
shown to be NP-hard 1 18]. Our approach can be viewed as a biclus- 
tering problem (however, with a different goal) as follows. Each 
meta-term r represents a bicluster in V whose rows are multiples 
of each other and contain non-zero values only. Each bicluster re- 
sults in a number of non-zero elements in matrix W (respectively, 
H) that is equal to the number of rows (respectively, columns) of 
the bicluster. The size of a bicluster is defined as the total num- 
ber of the contained rows and columns. Our goal is to obtain a 
set of disjoint biclusters that covers all non-zero elements of the 
input matrix such that the total size of biclusters is minimal. Un- 
fortunately, previous approaches for discovering biclusters cannot 
be easily extended to address the described objective function. 

8. CONCLUSION 

In this paper, we presented a novel approach for compressing 
inverted indices without information loss. We developed a novel 
compression approach that is based on exact factorization of sparse 
matrices. We proved that obtaining the optimal factorization is NP- 
hard, and developed an efficient greedy factorization algorithm. We 
described how to modify a typical top-fc search algorithm to elimi- 



nate the computational overhead at the retrieval time by exploiting 
characteristics of our compression scheme. Our experimental eval- 
uation shows that our technique achieves compression ratio of 35% 
while incurring negligible increase in the query evaluation time. 
We also showed that at compression ratio of 20%, the query re- 
sponse time is not affected by compression. Other lossless com- 
pression approaches such as variable-byte encoding can be inte- 
grated with our approach to achieve overall compression ratios up 
to 50%. 
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